The Redundancy of a Computable Code 
on a Noncomputable Distribution 
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Abstract — We introduce new definitions of universal and 
superuniversal computable codes, which are based on a code's 
ability to approximate Kolmogorov complexity within the pre- 
scribed margin for all individual sequences from a given set. 
Such sets of sequences may be singled out almost surely with 
respect to certain probability measures. 

Consider a measure parameterized with a real parameter and 
put an arbitrary prior on the parameter. The Bayesian measure 
is the expectation of the parameterized measure with respect to 
the prior. It appears that a modified Shannon-Fano code for any 
computable Bayesian measure, which we call the Bayesian code, 
is superuniversal on a set of parameterized measure-almost all 
sequences for prior-almost every parameter. 

According to this result, in the typical setting of mathemat- 
ical statistics no computable code enjoys redundancy which is 
ultimately much less than that of the Bayesian code. Thus we 
introduce another characteristic of computable codes: The catch- 
up time is the length of data for which the code length drops 
below the Kolmogorov complexity plus the prescribed margin. 
Some codes may have smaller catch-up times than Bayesian 
codes. 



I. What is a good computable code? 

Giving a reasonable definition to the notion of a good 
general-purpose compression algorithm is very important. Not 
so much for the practical data compression but rather for 
a theoretical analysis of statistical inference and machine 
learning. All parameter estimation or prediction algorithms 
can be transformed into compression algorithms via the idea 
of the plug-in code [1], [2], see also [3, Section 6.4.3]. 
A transformation in the opposite direction can be done for 
prediction, with a guaranteed standard risk in the iid case [3, 
Proposition 15.1-2]. With certain restrictions, the better we 
compress the better we predict. 

This article proposes a new simple theoretical framework for 
computable universal compression of random data (and thus 
for their prediction). Our results lie between the idealized algo- 
rithmic statistics [4], [5] and the present MDL perspective on 
mainstream statistical inference [6], [3], [7]. We offer a clearer 
path to understanding what good compression procedures are 
when the predicted data are generated by very complicated 
probability measures, cf. [8]. 

The prefix Kolmogorov complexity K{x) is the length of 
a code for a string x which we can never beat more than by 
a constant when we use computable prefix codes [9, Chapter 



3]Q Consequently, our theoretical evaluation of compression 
algorithms will be based on Kolmogorov redundancy |C(x)| — 
K(x) rather than on the traditional Shannon redundancy 



\Cix)\+logP{x), 



(1) 



where C is the inspected computable prefix code and P E A4 
is one of many candidate distributions for the data. 

A large body of literature has been devoted to studying 
codes that are minimax optimal with respect to ([T]i, exactly 
[10], [11], [12] or asymptotically [6], [3]. Let us notice that 
if the minimax expected Shannon redundancy 



min sup E ^r^p [\C{x) \ +logP(a;) 



or the minimax regret 



min sup max[|C(a;) 

C p(=M ^ 



logP{x)] 



(2) 



(3) 



are finite, plausibly bounded in terms of the data length, and 
achieved by a unique code C then the corresponding minimax 
properties appear a plausible rationale to argue for code C's 
optimality against data typical of a class of distributions A4. 

Things change when (|2]) or (O are infinite since then every 
code is a minimizer. Infinite or unbounded minimax values 
appear in fact in many statistical models: (i) There are no 
universal redundancy rates for stationary ergodic processes 
[13]. (ii) Even in the parametric iid case, like Poisson or 
geometric, one often has to restrict the parameter range to 
a compact subset to have a reasonable minimax code [3, 
Theorem 7.1 and Sections 11.1.1-2]. In a surprising contrast, 
the redundancy for computable parameters can be very small, 
which is known as superefficient estimation/compression [14], 
[15], [16]. 

The minimax values (|2]i or (l3]i may be infinite because 
there is no worst case of data rather than no intuitively good 
code. Often there exists an intuitively good code but to single 
it out with the minimax criterion, we have to modify the 

' To fix our notation, the prefix code C : X+ ^ ¥+ encodes strings over a 
countable alphabet X as strings over a finite alphabet Y = {0, 1, D — 1} 
and log is the logarithm to the base D. The prefix Kolmogorov complexity 
is considered with respect to the computer which accepts programs only 
from a prefix-free subset of Y+, D~ ^^^"i < f2 < 1. We call code 
C computable if both C and the inverse mapping can be computed by 
the computer |C(a;)| is the length of C{x). 



score ([T]i with some penalty. This idea has emerged in the 
MDL statistics in recent years. Griinwald [3, Sections 11.3 
and 11.4] reviewed a bunch of proposed heuristic penalties, 
which he called the "luckiness functions" or conditional NML 
(normalized maximum likelihood). In general, the penalties 
have form £{P, x) so the minimaximized function is 

\C{x)\+\ogP{x) -£{P,x) (4) 

Now, an important simple new idea. Typically for mathe- 
matical statistics, P is noncomputable (in the absolute sense). 
For instance, it may be given by an analytic formula with 
an algorithmically random parameter, to be estimated from 
the observed data x rather than known beforehand. On the 
other hand, the code C that we are searching for must be 
computable. We owe this insight to Vovk [17], who writes: 
The purpose of estimators is to be used for comput- 
ing estimates, and so their computability is essential. 
Accordingly, in our discussion we restrict ourselves 
to computable estimators. 

A parameter point is not meant to be computed by 
anybody. Depending on which school of statistics 
we listen to, it is either a constant chosen by Nature 
or a mathematical fiction. 
Consequently, the baseline —\ogP{x) in the coding game 
([TJ should be replaced by something uniformly closer to 
the smallest code length that we can achieve by effective 
computation. The prefix Kolmogorov complexity K{x) seems 
a fortunate candidate since 

\C{x)\>K{x)-K{C~'), (5) 

where K{C^^) is the length of any program to decode C, 
i.e., to compute C^^. When designing a general-purpose 
compressors C, one usually wants to keep K{C^^) small. 
We should subtract the generic luckiness function 

£iP,x) -.^ K{x) + logP{x) (6) 

from the criterion ([T]i before the minimax is applied since oth- 
erwise we punish an intuitively good code C for unlearnable 
idiosyncrasies and nonuniformity of the data. The luckiness (|6]l 
does not depend on code C and its expectation is nonnegative. 
As we will elaborate in Section HIl this very £{P, x) is close 
in several senses to algorithmic information I{P : x) about x 
in P. 

We conjecture that I{P : x) can grow for noncomputable P 
very fast in terms of the data length |a;|, like any function o{\x\) 
even in the iid case, cf. [8]. The order of the growth depends 
not only on the "parametric class" of P that statisticians like to 
think of but also on the exact "displacement" of algorithmic 
randomness in the possibly infinite definition of P. For in- 
stance, if P is computable given a computable parameter value 
then I{P : x) is bounded by the finite Kolmogorov complexity 
of P in view of the symmetry of algorithmic information 
[9, Theorem 3.9.1]. This bound can be also associated with 
the existence of a computable superefficient estimator of the 
parameter [16], [15], [17]. 



Although K{x) is noncomputable and we cannot evaluate 
the value of K{x) for any particular string x, we can obtain 
sufficiently good estimates of Kolmogorov complexity for 
strings typical of certain probability measures. This observa- 
tion inspires our new individualistic definitions of universal 
and superuniversal codes, which avoid minimax whatsoever In 
the following, italic y, ... G X+ are strings (of finite length), 
boldface x, y, ... e X°° are infinite sequences, and calligraphic 
X,S,... C X°° are subsets of these sequences. Symbol a;„ 
denotes the n-th symbol of x and is the prefix of x of 
length n: X — X1X2X3..., x" = xiX2---Xn- Consequently: 

Definition 1.1 (universal codes): Code C is called 
(<Y, o(/(n)))-universal if it is a computable prefix code and 
lim„^oo [|C(x")| - K{x''')] /f{n) = holds for all x e X. 

Definition 1.2 (superuniversal codes): Code C is called 
{X, /(r7,))-superuniversal if it is a computable prefix code and 
\C{x") \ - K{x") < f{n) holds n-ultimately for all x e X. 
Phrase "n-ultimately" is an abbreviation of "for all but finitely 
many n e N". 

Although Definitions 1 1 . lllO] reinterpret several probabilis- 
tic concepts of code universality that have been contemplated 
by Griinwald [3, pages 183, 186, and 200], only two specific 
kinds of known codes fall under these definitions. 

The codes discovered firstly are (5, o(n))-universal codes 
for sequences typical of certain stationary measures, such as 
the LZ code and many similar [18], [19], [8], [20]. Namely, for 
each stationary probability measure P over a finite alphabet 
there exists a set iSp of infinite sequences such that P{Sp) — 
1 and LZ is {Sp, o(n))-universal|3 Consequently, we may put 
S = Upes '^■P' where S is the set of all such measures. 

There exists also a second kind of good codes which 
consists of superuniversal codes for sequences typical of com- 
putable measures. For each computable measure P there exists 
a specific set Bp of infinite sequences such that P{Bp) — 1 
and a simple modification of the cornputable Shannon-Fano 
code is {Bp, |c(n)| + l)-superuniversal|jln the case of P(a;) — 
j Pe{x)dT^{0), we call this code the Bayesian code with 
respect to {{Pe} , tt, c). 

Consider the case when P is computable whereas Pg is 
not necessarily so. We will see easily in Section |III] that 
Pe{Bp) = 1 = P{Bp) for 7r-almost all 6. This simple 

^Our notation for distributions and measures follows the distinction between 
strings and infinite sequences. Italic P is a distribution of countably many 
strings X with P{x) > and P{^) = 1- Boldface P is also 
a distribution of strings x, f (x) > 0, but normalized against strings of 
fixed length P{x)l^^^^—„y = 1 and satisfying the consistency condition 
y^,y f (xy)!^ = f (x). Consequently there is a unique measure on 
the measurable sets of infinite sequences x, also denoted as P, such that 
P{{x : x" = X for n = \x\}) = P{x). 

^We. use symbol c : N — > ¥+ to denote a computable prefix code for 
natural numbers, < 1. For example, c(ra) may be chosen as 

the recursive a;-representation for n [21]. Then |c(n)| = log* n + 1, where 
log* n is the iterated logarithm of n to the base D. A different c(n) may be 
convenient for a study of superefficient compression, cf. [16], By an analogy to 
the distinction between P and P, we propose symbol C to denote a system of 
computable prefix codes for strings of fixed length. The corresponding Kraft 
inequalities are J]^ D-I<='(^)l < 1 versus ^^"''^'"''''^ < 1- 

Each code of form c{\x\)C(x) is a prefix code for strings of any length but 
the converse is not true. 



statement establishes in fact the ultimate near-optimality of 
Bayesian codes with respect to {{Pe},Tv,c) also for data 
typical of many simple noncomputable probability measures. 
The statement appears very powerful since we can let Pe 
be any parameterized measures considered by statisticians 
for years. To mention a few examples, we may consider 
iid Bernoulli, Poisson or discretized long-range dependent 
Gaussian time series. The result also explains why the MDL 
statistics has so resembled Bayesian inference so far. 

The motivation for Bayesian codes in the MDL statistics 
lies in the concept of the shortest effective description rather 
than in beliefs. Thus, in the MDL paradigm we can go farther 
and ask what computable codes are significantly shorter than 
a fixed Bayesian code0 Because of the A'(ri)-high oscillations 
of Kolmogorov complexity [9, Sections 2.5.1 and 3.4], one 
may hardly expect that there exist (A", /(n))-superuniversal 
codes for /(n) = o{K{n)) + 0(1). 

The ultimate redundancy does not seem a performance score 
that can be improved on if we can only define a computable 
Bayesian code for the contemplated statistical problem. This 
notwithstanding, another performance score can be attacked. 

Definition 1.3 (catch-up time): The catch-up time for an 
{X, /(n))-superuniversal code C is the function CUT( ■ ;C) : 
X°° ^ N U {oo} defined as 

CUT(cc;C) :=sup{n e N : |C(a;")| - > f{n)}. 

The catch-up time is the minimal length of data for which the 
code becomes almost as good as the Kolmogorov complexity. 
A simple lower bound for the catch-up time can be obtained 
by comparing two computable codes experimentally. Basing 
on the data provided by [7], we conjecture that some codes 
have much smaller catch-up times than Bayesian codes. 

In the remaining part of this article, we detail the mentioned 
results. In Section Ull we argue that the generic luckiness 
function is close to algorithmic information. In Section |III1 we 
prove that Bayesian codes are superuniversal for data typical 
of almost all parameter values. Some ideas for future research 
are sketched in the concluding Section HV] 

Our framework differs in several points to what has been 
done in the algorithmic and MDL statistics. Firstly, we insist 
on computable codes but apply both Kolmogorov complex- 
ity and noncomputable probability measures to evaluate the 
quality of the code. Secondly, we apply a stronger version 
of Barron's "no hypercompression" inequality to upper bound 
the code length in question with the Kolmogorov complexity 
rather. Secondly, we apply a stronger version of Barron's "no 
hypercompression" inequality to upper bound the code length 
in question with the Kolmogorov complexity. So far Barron's 
inequality was only used to lower bound the code length with 
minus log-likelihood. 

'^We consider here only computable Bayesian inference. It has been known 
that K(x) equals the length of certain noncomputable code having a Bayesian 
interpretation [9, Example 4.3.3 and Theorem 4.3.3]. 



II. A GENERIC "LUCKINESS" FUNCTION 

We will argue in this section that the generic luckiness 
function £(P, x) := K{x)+logP{x) is close to the algorithmic 
information about x in P. First of all, let us recall necessary 
concepts: 

(i) The universal computer is a finite state machine that 
interacts with one or more infinite tapes on which only 
a finite number of distinct symbols may be written in 
each cell. For convenience, we allow three tapes: tape 
a on which a finite program is written down, tape 
(3 (oracle) on which an infinite amount of additional 
information can be provided before the computations 
are commenced, and tape 7 from which the output of 
computations is read once they are finished. We assume 
that programs which the computer accepts on tape a 
form a prefix-free set of strings. 

(ii) To compute strings over an alphabet that is larger (e.g. 
countably infinite) than the alphabet allowed on tape 
7, we assume that the contents of 7 is sent to a fixed 
decoder once the computations are finished. 

(iii) The prefix Kolmogorov complexity K{x) of a string 
X is the length of the shortest program on tape a to 
generate the representation of string x on tape 7 when 
the computer does not read from tape /3. 

(iv) The conditional prefix Kolmogorov complexity K{x\y) 
is the length of the shortest program on tape a to 
generate the representation of string x on tape 7 when 
the representation of object y is given on tape /?. 

(v) The representation of an arbitrary distribution P on tape 
/3 is a list of probabilities [P{x)D"^ \D^"^ discretized 
up to TO digits, enumerated for all strings x and all 
precision levels d. (The same applies to a measure P 
respectively.) 

(vi) If the function {x,d) 1— > lP{x)D™ \D^™ can be com- 
puted by a program then we put K{P) to be the length 
of the shortest such program and call P computable. If 
P is not computable, we let K{P) 00. 

The old idea of Shannon-Fano coding [22, Section 5.9] 
yields thus the following proposition: 

Theorem 2.1: [5, the proof of Lemma II. 6] For a computer- 
dependent constant A, 

K{x\P) + \ogP{x) <A, (7) 
Y.^^p[K{x\P)+\ogP{x)]>0. (8) 
Constant A is the length of any program on tape a which 
computes x given the mapping y ^ P{y) put on tape j3 and 
x's Shannon-Fano codeword of length [— logP(a;)] appended 
on tape a after the program. Inequality ([8]) is the noiseless 
coding theorem for entropy and an arbitrary prefix code. 

The version of (|7]i for measure P requires an additional 
term to identify the string length. Now constant A becomes 
the length of a program on tape a which computes x given 
the mapping y ^ P{y) put on tape /?, the prefix-free 
representation of the string length n = \x\ appended on tape a 
after the program, and x's Shannon-Fano codeword of length 



\~logP{x)~\ appended on tape a after that. As the prefix- 
free representation of n, we choose the shortest program to 
generate n. The length of this program is denoted as K{n). 
For any computable code c for natural numbers, we have also 



K{n) < k{c 



\c{n)\ 



where K{c ^) is the length of any program to decode c. 
Theorem 2.2: For a computer-dependent constant A, 



Moreover, 



X(a;"|P) -hlogP(a;") < A- 
[if(a;"|P)+logP(x")] >0. 

/<i'(a;"|P) +logP(a;") > 



K{n), 



(9) 
(10) 

(11) 



n-ultimately for P-almost all sequences x. 

Inequality ( fTTT i stems from a bit stronger version of Barron's 
inequality than given in [23, Theorem 3.1]; 

Lemma 2.3 (Barron's "no hypercompression" inequality): 
Let W he a prefix code for strings of any length, not 
necessarily computable. Then 



+logP(x") > 



(12) 



n-ultimately for P-almost all sequences x. 
Remark: We may put |Ty(a;)| := i4r(2;| anything fixed) or 
VF(a;)| := K{m) + K{x\f{m)), where m depends on x in 
whatever way. 

Proof: Consider function Q{x) = By the 

Markov inequality, 

P (O is false) ^p( > 1 



\P{x''' 



Qix"-' 



P(a;") 



i{\x\=7i}Q{x) 



Hence J2n P ('US is false) < < 1 < oo by the 

Kraft inequality. In the following, we derive the claim with the 
Borel-Cantelli lemma. ■ 
Let us recall that the algorithmic information about a; in P 

is 



I{P : x) := K{x) - K{x\P) > 



(13) 



[9, Definition 3.9.1] — the last inequality holds without any 
additive constant for our definition of universal computer. 
A bit different definition of symmetric algorithmic informa- 
tion I{x;y) is sometimes also convenient [5, Eq. 11.3]. As 
a corollary of Theorems 12.11 and 12.21 we obtain bounds for 
luckiness term ^ which read 



i{P, x) - I{P ■.x)<A, 
E,^P [£{P,x)-I{P:x)]>0, 
i{P,x")^I{P:x") <A 
Ex^P [£(P,a;")-/(P:a;")] >0, 



K{n) 



whereas £{P,x") — I{P ■ x") > n-ultimately for P-almost 
all sequences x. 



111. Bayes is optimal for almost all parameters 

Adjust the programs for computing x from its Shannon- 
Fano codeword so that they use a built-in subroutine for 
computing x P{x) written on tape a rather than read the 
definition of this mapping from tape (3. Then we have: 

Theorem 3. 1 : [9, Theorem 8.1.1] For a computer-dependent 
constant A, 

K{x)+\ogP{x) < A + K{P), (14) 

E,^p[if(a;)+logP(x)] >0. (15) 
Theorem 3.2: For a computer-dependent constant A, 

K{x'') + log P(a;") <A + K{P) + K{n), (16) 
E...p[i^(x")+logP(x")] >0. (17) 

Moreover, 



^(a;") +logP(x") > 



(18) 



ri-ultimately for P-almost all sequences x. 

There are several simple corollaries of Theorem [ 

Definition 3.3 (Barron random sequence): A sequence x 
will be called P-Barron random if (fTsT l holds n-ultimately 
for X. The set of such sequences will be denoted as Bp. 

Definition 3.4 (Bayesian code): The Bayesian code with 
respect to ({P6)},7r,c) is the mapping C : X+ 3 x ^ 
C{x) = c{\x\)C{x) G Y+, where c : N ^ Y+ is a code 
for natural numbers, C (x) is the Shannon-Fano codeword for 
X with respect to P{x) ^ J Pe{x)d7T{9), {Pg : 6 e @} is 
a family of probability measures, and tt is a prior probability. 

Corollary 3.5: If the measure P and code c : N ^ 
Y+ are computable then the Bayesian code with respect to 
({Pe} , TT, c) is {Bp, \c{n) \ + l)-superuniversal. 

Proof: Of course, the hypothesis implies that C is a com- 
putable prefix code. We have |C(a;)| = |c(|a;|)|-|-[— logP(x)]. 
If (HI holds then |C(a;")| - i4:(x") < |c(n)| + 1. So C is 
(<Y, |c(n)| + l)-superuniversal. ■ 

Barron randomness is a refinement of a better known 
concept of algorithmic randomness of sequences. Let us recall 
that sequence x is P-Martin-Lof random if and only if 



i^(x") +logP(x") > -c 



(19) 



for some c > and all n [9, Definition 2.5.4 and Theorem 
3.6.1]. Denote the set of these sequences as Cp. We have 
Cp D Bp so PiCp) = P(Sp) ^ 1. If X e Cp \ Bp, 
however, the catch-up time CUT(a;; C) is infinite for |C(a;)| = 
|c(N)|+[-logP(a.)l. 

In the next step we will interpret the set of Barron random 
sequences ;Bp as a superset of sequences typical of certain not 
necessarily computable measures Pg. 

Corollary 3.6: Consider a probability measure of form 
P{x) = J Pg{x)dTT{6) for any measurable parameterization 
@ 3 9 ^ Pe where both prior tt and Pg are probability 
measures. Equality Pg{Bp) = 1 holds for 7r-almost all 9. 

Proof: Let g„ {0 £ : Pg[Bp) > 1 - 1/n}. By 
TheoremiH 1 = P(Sp) < 7r(^;„) + 7r(0 \ 6;„)(1 - 1/n) = 
1 — n^^7r(0 \ Qn). Thus 7r(C/„) = 1. Finally, we appeal to 



cr-additivity of tt. For G {9 e & : Pe{Bp) = 1} = n„ Gn 
we obtain tv{Q) ~ inf„ 7r(C?„) = 1. ■ 

Corollaries 13 .51 and 13 .61 demonstrate that the ultimate redun- 
dancy of a Bayesian code is nearly optimal when compared 
with any computable code on data typical of noncomputable 
parameterized measures Pq. This statement holds for any 
imaginable statistical model {Pq : 6 G 0}. Computability of 
the Bayesian code is the only restriction and the only caveat is 
that Pg{Bp) = 1 holds for prior-almost all parameters e 
rather than for all of them. 

Example 3.7 (a code for "almost all" distributions): This 
example stems from the observation that we can encode any 
probability measure on with a single infinite sequence 
e = 9i9293... over the alphabet Y = {0, 1, D ~ 1} 3 e„E 

For simplicity let the input alphabet be the set of natural 
numbers, X := N. The link between 9 and a measure Pg will 
be established by imposing equality Pg{x'^) — 

(20) 

where P(A) — 1 for the empty word and a bijection : 
N"*" X N ^ N is used. It is easy to see that Pg is a probability 
measure on X°° for each 9. Conversely, each probability 
measure on X°° equals Pg for at least one 9. 

Let the prior be the uniform iid measure 7r(^?™) :— 
]j-m Qrn ._ Q^Q^_ £^^ The Baycsian measure P{x) ~ 
J Pg{x)dir{9) is computable. Consequently, the Bayesian 
code with respect to {{Pg} ,t^,c) is computable and 
{Bp, \c{n) \ + l)-superuniversal. 

Whereas parameterization ( l20l i is general, the measure P 
introduced in this example equals simply log2P(a;") = 
~'Yl,'i=iXi- Although Pg{Bp) = 1 for 7r-almost all 9, 
the Bayesian code with respect to this P is suboptimal for 
stationary measures different to P. 

IV. Conclusion 

We hope that our simple insights may be used in future 
research to better characterize several paradoxical phenomena 
that have haunted the emerging MDL statistics. These phe- 
nomena are: nonexistence of universal redundancy rates, su- 
perefficient compression/estimation, converging and diverging 
Bayesian predictors, and various "catch-up" phenomena. 

It is important to understand for which particular parameters 
the claim of Corollarv 13.61 holds or fails. Inspired by [17] and 
[4], we have started contemplating the following problem: 

Question 4.1: Consider a computable measure P{x) — 
J Pg{x)dTT{9), where parameter values 9 are infinite se- 
quences as well. Let X be the set of sequences for which the 
Bayesian code with respect to {{Pg},7r,c) is |c(ri)| + 
-universal. What does Pg{X) equal for 9 that (i) are 
algorithmically random, or (ii) exhibit a deficiency of algo- 
rithmic randomness (e.g. they are computable)? 

^One can also put S = 5Ilfc^i^fc^~* since the set of real numbers having 
two different D-ary expansions is negligible. 



We have already proved that Pg{X) = 1 for (i) whereas 
Pg{X) ~ for (ii) under some natural conditions, e.g., for 
exponential iid distributions Pg. 

The second group of interesting open problems concerns 
catch-up times. Can we know the catch-up times approxi- 
mately? How can we use this knowledge to verify or to falsify 
a statistical model for concrete data of limited length? 
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